STA 360 Lab 3: Posterior Predictive Checks

Data

The data we have are counts of deaths per episode of Game of Thrones.

## # A tibble: 6 x 3
##   Season Episode Count
##    <dbl>   <dbl> <dbl>
## 1      1       1     4
## 2      1       2     3
## 3      1       3     0
## 4      1       4     1
## 5      1       5     5
## 6      1       6     4

A quick histogram of the data:

Because we have count data, it is only natural to model the data with a Poisson distribution. If we take the empirical mean of the data as our estimate for \(\hat{\lambda}\), what does the sampling distribution look like?

Plotting the histogram of the data next to the histogram of the sampled data using \(\hat{\lambda} = \bar{y}\), we see that there are many more 0-valued observations in the observed data than there are in the simulated data. We suspect that the Poisson model will not be a good fit for the data, but we will proceed anyway.

Fit Poisson model

We have learned that the Gamma distribution is conjugate for Poisson data. Here, we have chosen the prior \(\lambda \sim Gamma(10,2)\) based on prior knowledge that Game of Thrones is a show filled with death. We run the model in Stan. We save the posterior sampled values of \(\lambda\) in the variable `lambda_draws’.

## Warning in readLines(file, warn = TRUE): incomplete final line found on '/
## Users/jb/Work/Teaching/STA601/labs/undergraduate/poisson-simple.stan'

We use the function mcmc_areas() to plot the smoothed posterior distribution for \(\lambda\). The line represents the posterior mean, and the shaded area represents 90% of the posterior density. You can play around with changing the `prob’ parameter in mcmc_areas(). We also report the posterior mean and 90% HPD interval using this Poisson-Gamma model.

## Inference for Stan model: poisson-simple.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##        mean se_mean   sd 2.5%  25% 50%  75% 97.5% n_eff Rhat
## lambda    4    0.01 0.23 3.59 3.84   4 4.15  4.48   820    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Jul 12 12:29:30 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

How does the posterior mean for \(\lambda\) compare to the sample mean?

Graphical Posterior Predictive Checks (PPC)

Let us generate posterior predictive samples using the posterior values of \(\lambda\).

We now step through some graphical posterior predictive checks. Here we plot the histogram of the observed counts to several of the posterior predictions. What do you notice?

We can compare the density estimate of \(y\) to the density estimates for several (60) of the \(y_rep\)s. What do you notice here?

In our first histogram of the data, we noticed the high number of 0 counts. Let us calculate the proportion of zeros in \(y\), and compare that to the propotion of zeros in all of the posterior predictive samples.

## [1] 0.1643836
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can plot the means and variances for all of the simulated datasets, along with the observed statistics.

## Warning: Removed 8 rows containing missing values (geom_bar).

Based on on these PPCs, does this model appear to be a good fit for the data?

Poisson Hurdle Model

How should we account for the high frequency of zeros? As it turns out, there is much literature on how to work with this sort of data. We can fit a zero-inflated Poisson model, where the data is modelled as a mixture of a Poisson distribution with a point mass at zero. Here, we fit a hurdle model: \(y_i = 0\) with probability \(\theta\), and \(y_i > 0\) with probability \(1-\theta\). Conditional on having observed \(y_i\) nonzero, we model \(y_i\) with a Poisson distribution which is truncated from below with a lower bound of 1. For the truncated Poisson, we use the same \(Gamma(10,2)\) prior as in the simple Poisson model above.

## Warning in readLines(file, warn = TRUE): incomplete final line found on '/
## Users/jb/Work/Teaching/STA601/labs/undergraduate/poisson-hurdle.stan'
## Warning: There were 8 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Inference for Stan model: poisson-hurdle.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##        mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## lambda 4.77    0.01 0.28 4.22 4.58 4.76 4.95  5.33  1449    1
## theta  0.17    0.00 0.04 0.10 0.14 0.17 0.20  0.27  1632    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Jul 12 12:30:17 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Here we compare the posterior densities of \(\lambda\) from the simple Poisson model and the Poisson hurdle model. Please finish the code:

Obtaining posterior samples from the hurdle model is more complicated, so the Stan file includes code to draw from the posterior predictive distribution for you. We extract them here, and store the predictive samples in `y_rep2’.

Plot the remaining PPCs, and comment on how this second model compares to both the observed data and to the simple Poisson model.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8 rows containing missing values (geom_bar).

Leave-one-out cross-validation

Here, we assess the predictive performance of both models with leave-one-out cross-validation (LOOCV). LOOCV holds out a single observation form the sample, fits the model on the remaining observations, uses the held out data point as validation, and calculates the prediction error. This process is repeated \(n\) times, such that each observation has been held out exactly once. We make use of the hand loo() function, and compare the two models:

## 
## Computed from 2000 by 73 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -203.1 14.7
## p_loo         2.5  0.5
## looic       406.2 29.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Computed from 2000 by 73 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -183.3 10.8
## p_loo         2.8  0.5
## looic       366.6 21.7
## ------
## Monte Carlo SE of elpd_loo is 0.0.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## elpd_diff        se 
##      19.8       8.5

Which model performs better in terms of prediction?

Wrapping up

Some final questions to consider:

  1. Why are PPCs important?
  2. Was the second model a good fit for the data? Why or why not?
  3. If someone reported a single LOOCV error to you, would that be useful? Why or why not?

*Adapted from tutorial

STA 360: Bayesian Inference and Modern Statistical Methods

12 July, 2019